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The  debugging  cycle  is  the  most  common  methodology  for  finding  and  correcting  errors  in  sequential 
programs.  Cyclic  debugging  is  effective  because  sequential  programs  are  usually  deterministic.  Debugging 
parallel  programs  is  considerably  more  difficult  because  successive  executions  of  the  same  program  often  do 
not  produce  the  same  results.  In  this  paper  we  present  a  general  solution  for  reproducing  the  execution 
behavior  of  parallel  programs,  termed  Instant  Replay.  During  program  execution  we  save  the  relative  order 
of  significant  events  as  they  occur,  not  the  data  associated  with  such  events.  As  a  result,  our  approach 
requires  less  time  and  space  to  save  the  information  needed  for  program  replay  than  other  methods.  Our 
technique  is  not  dependent  on  any  particular  form  of  interprocess  communication.  It  provides  for  replay  of 
an  entire  program,  rather  than  individual  processes  in  isolation.  No  centralized  bottlenecks  are  introduced 
and  there  is  no  need  for  synchronized  clocks  or  a  globally-consistent  logical  time.  We  describe  a  prototype 
implementation  of  Instant  Replay  on  the  BBN  Butterfly  Parallel  Processor,  and  discuss  how  it  can  be 
incorporated  into  the  debugging  cycle  for  parallel  programs,  r'' 
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with  such  events.  As  a  result,  our  approach  requires  less  time  and  space  to 
save  the  information  needed  for  program  replay  than  other  methods.  Our 
technique  is  not  dependent  on  any  particular  form  of  interprocess 
communication.  It  provides  for  replay  of  an  entire  program,  rather  than 
individual  processes  in  isolation.  No  centralized  bottlenecks  are  introduced 
and  there  is  no  need  for  synchronized  clocks  or  global ly-consistent  logical 
time.  We  describe  a  prototype  implementation  of  Instant  Replay  on  the  BBN 
Butterfly  Parallel  Processor,  and  discuss  how  it  can  be  incorporated  into 
the  debugging  cycle  for  parallel  programs. 


1.  Introduction 

Debugging  sequential  programs  is  a  well  understood  task  that  draws  on  tools  and  techniques 
developed  over  many  years.  One  early  technique  was  to  record  snapshots  of  the  entire  program  state,  often 
consisting  of  many  pages  of  hexadecimal  digits,  for  perusal  by  the  programmer.  Debugging  was  a 
programmer-intensive  operation,  since  there  were  few  tools  for  analyzing  the  program  state.  Over  time  this 
approach  was  replaced  by  interactive  debuggers,  which  allow  the  programmer  to  examine  arbitrary  oeu:!' 
of  the  program  state  during  execution.  Debugging  became  more  computation-intensive,  since  the  computer 
was  used  to  reproduce  execution  sequences  with  successively  greater  detail.  As  a  result,  the  most  common 
methodology  used  today  to  debug  sequential  programs  is  cyclic:  the  program  is  executed  until  an  error 
manifests  itself,  the  programmer  postulates  a  set  of  underlying  causes  for  the  error,  trace  statements  or 
additional  breakpoints  are  inserted  to  gather  more  information  about  the  causes  of  the  error,  and  the 
program  is  reexecuted.  This  technique  is  effective  because  sequential  programs  are  usually  deterministic 
That  is.  for  a  fixed  input,  each  execution  of  a  program  will  always  follow  the  same  execution  path  and 
produce  the  same  results. 

Debugging  parallel  programs  is  considerably  more  difficult  because  parallel  programs  are  often  not 
deterministic.1  In  our  model  parallel  programs  consist  of  multiple  asynchronous  processes  that  communicate 
using  some  form  of  message-passing  or  shared  memory.  No  assumption  may  be  made  about  the  relative 
speed  of  processes:  we  can  only  assume  finite  progress  by  each  process.  Since  parallel  programs  do  not  fully 
specify  all  possible  execution  sequences,  the  execution  behavior  of  a  parallel  program  in  response  to  a  fixed 
input  may  be  indeterminate,  with  the  results  depending  on  a  particular  resolution  of  race  conditions 
existing  among  processes.  Therefore,  cyclic  debugging  techniques  for  error  isolation  are  not  guaranteed  to 
work  because  successive  executions  of  the  same  parallel  program  may  not  produce  the  same  results  We 
are  left  with  two  options  for  debugging  parallel  programs:  we  can  either  take  snapshots  of  the  program 
state  during  execution  for  later  examination  or  we  can  provide  a  mechanism  that  guarantees  reproducible 
behavior  of  parallel  programs.  Only  the  latter  approach  allows  reliable  use  of  cyclic  debugging  techniques. 

The  first  alternative,  in  which  the  programmer  analyzes  snapshots  of  program  state  taken  during 
execution,  recognizes  that  multiple  executions  of  parallel  programs  are  indeterminate,  therefore,  all 
information  necessary  to  diagnose  program  errors  must  be  collected  during  a  single  execution.  Behavioral 
Abstraction  (BA)  is  typical  of  this  approach  (2).  BA  provides  a  mechanism  for  the  hieraichical  definition  of 
events  in  terms  of  sequences  of  primitive  events  that  can  occur  during  program  execution.  An  event 
recognition  tool  monitors  the  stream  of  primitive  events  that  occur  during  program  execution  and  presents 
the  user  with  an  abstract  view  of  the  program's  behavior  in  terms  of  a  sequence  of  hierarchically-defined 
events.  There  are  two  disadvantages  to  this  technique.  First,  BA  requires  that  a  user  exhaustively  describe 
interesting  events  which  take  place  during  execution  in  terms  of  a  bottom-up  specification.  In  creating  the 
specification,  the  user  must  anticipate  all  interesting  events  related  to  an  error  before  execution;  there  is  no 
mechanism  for  gathering  additional  information  about  an  error  after  it  is  observed.  Second,  the  amount  of 

1  We  are  interested  in  programs  that  exhibit  true  parallelism  or  ,  at  the  very  last.  apear  to  exhibit  parallelism  due  to  preempts 
scheduling  of  processes  A  concurrent  program  implemented  by  coroutines  running  on  i  single  processor  without  the  possibility  of 
preemption  can  be  debugged  as  if  it  were  a  sequential  program 


information  gathered  tends  to  be  voluminous.  Since  the  technique  is  not  cyclic,  the  user  must  collect 
enough  infonnatioa  during  execution  to  diagnose  any  error  that  might  arise.  Other  work  based  on  one-shot 
execution  of  parallel  programs  has  the  same  limitation  [1.9, 21]. 

The  second  alternative  for  debegging  parallel  programs  is  based  on  reproducible  program  execution, 
which  allows  cyclic  debusing  techniques  to  be  applied.  Reproducible  program  behavior  has  been  studied 
in  several  ihw—'t,  including  concurrent  programs  using  semaphores  and  monitors  for  communication, 
systems  based  on  nested  atomic  transactions,  and  systems  comprised  of  loosely-coupled  processes  that 
communicate  via  messages. 

Carver  and  Tai  have  considered  repeatable  execution  for  programs  consisting  of  concurrent  processes 
that  interact  through  semaphores  and  monitors  [3|.  In  their  approach,  execution  of  a  concurrent  program  is 
characterized  by  a  sequence  of  P  operations  (termed  a  P-sequence)  on  shared  semaphores.  The  same  idea 
can  be  used  to  produce  an  M-sequence  for  monitors,  which  records  a  series  of  starts  of  monitor  procedures. 
A  P-sequence  is  a  sequence  of  ordered  pairs;  each  pair  corresponds  to  a  P  operation  on  a  specific 
semaphore  by  a  specific  process.  Thus,  a  P-sequence  is  a  total  order  of  all  synchronization  operations  that 
occur  in  a  program.  P-sequences  can  be  created  by  the  programmer  to  test  specific  synchronization 
sequences  of  a  concurrent  program  or  can  be  reproduced  during  execution  to  provide  repeatable  execution. 
The  disadvantage  of  this  approach  is  that  it  requires  that  all  P  operations  be  serialized,  thereby  losing  much 
of  the  potential  for  parallelism  that  exists  in  a  program.  While  adequate  for  single  processor  systems  that 
simulate  concurrency,  this  technique  would  not  be  usefol  in  a  truly  parallel  environment.  There,  the 
serialization  constraint  could  have  such  an  impact  on  program  performance  that  it  would  be  impractical  to 
monitor  programs  during  normal  execution.  Use  of  this  method  would  be  then  relegated  to  a  distinct 
testing  and  debugging  phase. 

Chiu's  technique  for  replaying  a  program’s  execution  in  an  atomic  transaction  system  involves 
checkpointing  each  version  of  all  atomic  objects  and  recording  a  timestamp  for  each  atomic  action  during 
program  execution  [S].  A  debugger  uses  this  information  to  traverse  action  trees  (corresponding  to  the 
nested  atomic  actions  of  a  program  execution)  according  to  a  serialization  of  their  constituent  atomic 
actions.  Traversing  an  action  tree  permits  viewing  the  state  of  atomic  objects  before  and  after  each  atomic 
update,  as  well  as  replaying  execution  through  action  sequences  to  isolate  program  flaws.  The  major 
drawback  of  this  work  is  that  the  techniques  are  restricted  to  computations  structured  in  terms  of  nested 
atomic  actions.  In  addition,  these  techniques  require  significant  storage  overhead  to  maintain  the  necessary 
checkpoints  of  atomic  objects,  although  the  checkpoints  may  be  required  for  recovery  actions  anyway. 

Methods  an  reproduce  the  execution  behavior  of  programs  comprised  of  loosely-coupled  processes 
that  communicate  rang  messages  typically  require  that  the  coo— ts  of  «w!h  message  be  recorded  in  an 
event  log  as  it  is  received  p.  13, 24].  The  programmer  can  either  review  the  events  (messages)  in  the  log.  in 
an  attempt  to  isolate  errors,  or  the  events  can  be  used  as  input  lo  replay  the  execution  of  a  process  in 
isolation.  A  similar  event  logging  approach  has  also  been  used  to  monitor  program  performance  [16] 
There  are  several  disadvantages  to  this  approach.  First,  it  has  only  been  used  in  loosely-coupled  systems 
and  there  are  —sons  to  believe  it  would  not  be  weU-suited  to  tightly-coupled  systems.  Although  the 
amount  of  data  exchanged  in  messages  could  be  very  large,  this  technique  exploits  the  fact  that 


communication  in  loosely -coupled  systems  takes  place  infrequently,  primarily  because  of  the  high  cost  of 
communication.  The  additional  time  necessary  to  copy  a  message  into  an  event  log  in  local  memory  does 
not  seriously  affect  performance  when  compared  with  the  time  required  to  send  a  message.  This 
assumption  does  not  necessarily  apply  to  tightly-coupled  systems,  where  the  cost  of  communication  is 
lower,  allowing  more  frequent  communication.  Another  disadvantage  is  that  the  space  requirements  for  the 
event  log  tend  to  be  very  large.  Again,  within  the  domain  of  loosely-coupled  processes,  it  is  reasonable  to 
assume  the  logs  will  grow  slowly  enough  that  they  can  be  buffered  in  memory  and  then  stored  on  external 
devices  without  seriously  affecting  the  performance  of  the  program.  The  third,  and  most  important 
drawback,  is  that  it  is  difficult  to  examine  the  global  effects  of  process  interactions  using  this  technique, 
since  the  replay  mechanism  only  operates  on  a  single  process  in  isolation.  Previous  attempts  to  replay 
groups  of  processes  using  this  scheme  require  that  a  network-wide  consistent  time  be  maintained  [7]. 

In  this  paper  we  present  a  general  solution  for  reproducing  the  execution  behavior  of  parallel 
programs,  termed  Instant  Replay.  Our  emphasis  is  on  providing  repeatable  execution  of  highly  parallel 
programs  in  tightly-coupled  systems,  although  our  approach  naturally  extends  to  loosely-coupled  systems. 
During  program  execution  we  save  the  relative  order  of  significant  events  as  they  occur,  not  the  data 
associated  with  such  events.  Since  we  do  not  require  the  contents  of  all  process  interactions  (e.g..  messages) 
to  be  saved,  our  approach  requires  less  time  and  space  to  save  the  information  needed  for  program  replay 
than  other  methods.  Our  technique  guarantees  reproducible  program  behavior  during  the  debugging  cycle 
by  »<ing  the  same  input  from  the  external  environment  and  by  imposing  the  same  relative  order  on  events 
during  replay  that  occurred  during  the  original  execution.  Unlike  previous  techniques.  Instant  Replay  is 
not  dependent  on  the  particular  form  of  interprocess  communication  used.  In  addition,  we  provide  replay 
for  an  entire  program,  rather  than  individual  processes  in  isolation.  Finally,  we  avoid  introducing  any 
global  synchronization  of  events  through  the  use  of  a  fully  distributed  protocol;  there  is  no  centralized 
bottleneck  and  no  need  for  synchronized  clocks  or  a  globally-consistent  logical  time.  With  these  properties. 
Instant  Replay  is  especially  useful  for  debugging  parallel  programs  on  tightly-coupled  multiprocessors, 
where  interprocess  communication  is  cheaper,  and  therefore  more  frequent,  than  in  loosely-coupled 
systems. 

In  the  next  section  we  present  Instant  Replay,  including  our  goals,  assumptions,  and  approach. 
Section  3  describes  a  prototype  implementation  on  the  BBN  Butterfly,  a  tightly-coupled  muluprocessor 
comprised  of  128  MC68000  processors.  In  Section  4  we  discuss  how  Instant  Replay  can  be  incorporated 
into  the  debugging  cycle  for  parallel  programs.  Section  5  summarizes  the  advantages  of  our  approach  and 
describes  our  plans  for  future  work. 

2.  Instant  Replay 

When  debugging  a  sequential  program,  one  can  usually  guarantee  reproducible  program  execution  by 
supplying  the  same  input  each  time  the  program  is  executed.  Successive  executions  with  the  same  input 
produce  the  same  behavior  because  sequential  programs  tend  to  be  deterministic.  The  same  is  true  of  the 
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individual  processes  in  a  parallel  program.2  If  each  process  is  supplied  the  same  input  values  (corresponding 
to  the  contents  of  messages  received  or  the  values  of  shared  memory  locations  referenced)  in  the  same  order 
during  successive  executions,  it  will  produce  the  same  behavior  each  time.  In  particular,  each  process  will 
produce  the  same  output  values  in  the  same  order.  Each  of  those  output  values  may  then  serve  as  an  input 
value  for  some  other  process.  Therefore,  in  order  to  debug  a  parallel  program,  we  do  not  need  to  store  all 
input  values  for  each  process  in  an  event  log,  since  any  input  value  corresponding  to  some  output  value  can 
be  recomputed  during  replay.  By  ensuring  that  each  process  sees  the  same  input  values  at  every  step  of 
execution,  all  processes  will  exhibit  the  same  execution  behavior  during  both  the  monitoring  phase  and 
replay.  Instant  Replay  is  based  on  this  observation. 

In  our  approach,  all  interactions  between  processes  are  modeled  as  operations  on  shared  objects.  A 
series  of  modifications  to  a  shared  object  is  represented  as  a  totally  ordered  sequence  of  versions.  Each 
version  has  a  corresponding  version  number,  which  is  unique  with  respect  to  a  particular  object.  During 
normal  program  execution  (Lc,  the  monitoring  phase)  we  record  a  partial  order  of  the  accesses  to  each 
object  (It  is  a  partial  order  because  we  do  not  need  to  impose  an  ordering  on  the  processes  that  read  a 
particular  version  of  a  shared  object)  This  partial  order  is  specified  by  a  sequence  of  version  numbers  for 
each  object.  To  record  the  partial  order  the  system  maintains  the  current  version  number  for  each  object 
and  the  number  of  readers  for  each  version  of  each  object  In  addition,  each  process  records  the  version 
number  of  each  shared  object  it  accesses.  During  program  replay,  we  allow  each  process  to  recompute  its 
output  values,  thereby  providing  input  values  for  other  processes.  We  use  the  record  of  object  accesses 
recorded  by  each  process  to  ensure  that  the  same  version  of  input  values  used  by  the  process  during  the 
monitoring  phase  is  used  during  replay.  As  long  as  the  recorded  information  is  available,  the  original 
program  execution  can  be  repeated  over  and  over. 

Our  goal  is  to  provide  a  flexible  monitoring  system,  applicable  in  both  loosely-coupled  and  tightly- 
coupled  environments,  that  allows  a  programmer  to  replay  arbitrary  execution  sequences  produced  by  a 
parallel  program.  Since  we  cannot  predict  when  it  may  be  desirable  to  replay  a  particular  execution 
sequence,  it  must  be  practical  for  the  monitoring  mechanisms  to  be  in  place  during  every  execution. 
Therefore,  our  mechanisms  must  have  minimal  impact  on  program  performance.  Instant  Replay  provides 
reproducible  behavior  of  parallel  programs  with  minimal  impact  on  performance  by  (a)  simulating  the 
original  external  environment  during  replay,  (b)  modeling  all  interprocess  events  as  operations  on  shared 
data,  subsuming  both  shared  memory  and  message-passing  primitives,  (c)  recording  only  the  version 
number  of  data  values  that  are  input  to  each  process,  not  the  values  themselves,  thereby  minimizing  the 
amount  of  information  recorded,  and  (d)  using  a  distributed  data  collection  mechanism,  so  that  no  central 
bottleneck  is  present  when  a  program  is  being  monitored  or  replayed.  We  will  explore  each  of  these 
aspects  in  the  following  sections. 


2  For  now,  we  mume  that  processes  do  not  contain  non  deterministic  statements  In  particular,  processes  do  not  allow  asynchro¬ 
nous  interrupts 


2.1.  Simulating  the  External  Environment 

As  with  any  cyclic  debugging  system,  we  assume  that  the  original  execution  of  a  program  and 
subsequent  replays  occur  in  equivalent  virtual  machine  environments.  Two  virtual  machines  A  and  B  are 
said  to  be  equivalent  with  respect  to  program  P  if  program  P  can  exhibit  the  same  behavior  whether 
executed  on  virtual  machine  A  or  B.  For  practical  reasons,  we  do  not  require  equivalent  physical  machine 
states,  since  that  would  indue:  the  contents  of  all  external  devices,  the  exact  value  of  the  clock,  and  the 
internal  states  of  all  components,  in  particular,  A  and  B  need  not  have  identical  real-time  clock  values  if 
P's  execution  does  not  depend  on  the  real-time  clock.  Similarly,  the  contents  of  file  F  on  machine  A  and  B 
can  differ  if  P  does  not  attempt  to  reference  F.  If  program  P  depends  on  physical  details  of  its  virtual 
machine  during  execution,  it  becomes  difficult,  if  not  impossible,  to  simulate  the  virtual  machine  during 
replay. 

Real-time  programs,  in  particular,  may  not  be  good  candidates  for  Instant  Replay  because  it  is  so 
difficult  to  simulate  equivalent  virtual  machines.3  We  require  that  programs  receive  identical  input  from  the 
environment  during  both  execution  and  replay.  However,  it  is  not  sufficient  simply  to  supply  the  same 
input  to  the  process,  we  must  also  supply  it  at  the  same  points  during  program  execution.  This  can  be  very 
difficult  for  real-time  programs  since  they  often  receive  input  as  a  result  of  asynchronous  interrupts. 
Without  making  special  provisions  to  record  when  interrupts  occur  during  program  execution,  which  could 
severely  degrade  performance,  we  cannot  accurately  simulate  the  original  virtual  machine  environment 

It  is  important  to  note  that  the  problem  of  finding  equivalent  virtual  machines  also  arises  when 
debugging  sequential  programs;  it  is  orthogonal  to  the  specific  problem  of  debugging  parallel  programs. 
We  do  not  depend  on  a  particular  simulation  of  virtual  machines,  so  any  techniques  developed  for 
sequential  program  debugging  can  probably  be  used.  Specifically,  we  assume  that  programs  do  not  exploit 
the  physical  characteristics  of  any  resources  allocated  by  the  system,  therefore,  we  need  only  ensure  that  the 
amount  of  resources  available  during  replay  is  at  least  the  amount  used  by  the  program  during  the  original 
execution.  Any  unsuccessful  attempt  to  allocate  resources  during  execution  can  be  recorded,  so  that  the 
same  behavior  can  be  recreated  during  replay. 

2.2.  Communication  Through  Shared  Objects 

If  processes  in  a  parallel  program  do  not  communicate,  each  process  can  be  debugged  using 
traditional  techniques,  since  other  processes  in  the  program  would  have  no  effect  on  the  execution  path  of  a 
particular  process.  It  is  only  when  processes  interact,  via  communication  and  synchronization  primitives, 
that  the  potential  for  nonrepeatable  behavior  arises.  Examples  of  process  interactions  include  P  and  V 
primitives  applied  to  a  shared  semaphore  [8],  monitor  entry  procedures  (10),  send/receive  message 
primitives,  and  general  sharing  of  memory.  Instant  Replay  models  all  process  interactions  in  a  parallel 
program  as  operations  on  shared  data.  This  characterization  of  process  interactions  is  not  restrictive  since  all 
communication  and  synchronization  primitives  can  be  reduced  to  operations  on  shared  data.  In  particular, 
message-passing  can  be  modeled  as  communication  through  a  shared  port,  mailbox,  or  memory  segment. 

3  To  our  knowledge,  so  significant  software  debugging  system  exists  for  real-time  programs 
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Our  approach  exploits  the  fact  that  values  exchanged  between  processes  via  shared  data  depend  onl> 
on  the  initial  values  in  shared  objects,  the  order  in  which  processes  are  granted  access  to  shared  objects,  and 
the  deterministic  nature  of  processes.  Operations  on  shared  data  objects  can  be  separated  into  two  classes: 
read  operations,  which  do  not  change  the  state  of  an  object,  and  write  operations,  which  do.  By  recording 
the  sequence  of  write  operations  on  each  shared  object,  it  is  possible  to  recreate  the  proper  sequence  of 
state  transitions  for  all  shared  objects  during  program  replay.  Similarly,  by  recording  the  version  number 
of  each  shared  object  read  by  a  process,  it  is  possible  to  recreate  the  proper  input  values  for  that  process 
during  replay.  This  is  exactly  the  information  we  record  during  the  monitoring  phase. 

Instant  Replay  requires  that  the  set  of  operations  on  each  shared  object  have  a  valid  serialization  A 
set  of  operations  has  a  valid  serialization  if  the  result  of  each  individual  operation  is  the  same  as  it  would 
be  if  the  operations  had  all  been  executed  in  some  sequential  order.  A  protocol  that  ensures  a  valid 
serialization,  such  as  a  concurrent-read-exclusive-write  (CREW)  protocol  [6],  must  be  used  for  access  to 
each  shared  object.  In  choosing  a  protocol,  we  look  for  one  that  guarantees  serializability,  while  exerting 
minimal  impact  on  shared  object  access  and  allowing  maximal  parallelism.  If  an  access  protocol  that 
guarantees  serializability  for  operations  on  shared  objects  is  already  present  in  the  application  or  the  system, 
it  is  not  necessary  to  superimpose  another.  Therefore,  our  techniques  are  applicable  to  programs  that 
incorporate  results  of  current  research  efforts  on  how  to  structure  interprocess  communication  to  admit  the 
most  parallelism.  For  example,  Lamport  (12},  Peterson  [20],  and  Vitanyi  &  Awerbuch  [25]  present 
algorithmic  solutions  for  the  concurrent-reading-while-writing  (CRWW)  problem  that  permit  concurrency 
among  readers  and  writers,  as  well  as  among  writers  themselves.  Instrumentation  for  Instant  Replay  can  be 
added  to  systems  that  use  such  protocols,  if  a  serialization  order  of  operations  on  each  shared  object  can  be 
determined4 

For  the  remainder  of  this  paper,  we  will  illustrate  our  technique  using  a  CREW  protocol  for  access  to 
shared  objects.  A  CREW  protocol  ensures  a  total  order  of  writers  with  respect  to  each  shared  object,  a 
total  order  of  readers  with  respect  to  writers  of  each  shared  object,  and  a  partial  order  of  readers  with 
respect  to  each  shared  object.  Although  we  could  use  a  protocol  that  requires  mutually-exclusive  (MFJ 
access  to  shared  objects,  resulting  in  a  total  order  on  accesses  to  each  object,  many  parallel  programs  allow 
concurrent  readers;  an  exclusive  access  protocol  would  artificially  limit  the  parallelism  in  such  programs. 
Since  the  execution  path  of  a  program  can  be  characterized  by  a  partial  order  on  the  operations  with 
respect  to  each  shared  object,  our  approach  is  to  record  a  partial  order  of  the  accesses  to  each  shared  object 
during  the  monitoring  phase  and  enforce  that  partial  order  during  replay. 

In  addition  to  being  independent  of  a  particular  protocol  Instant  Replay  does  not  rely  on  a  particular 
granularity  of  interprocess  communication.  The  granularity  of  access  to  shared  objects  is  implementation  - 
dependent  Message-passing  systems  only  require  the  protocol  during  shared  buffer  access;  shared-memory 
systems  may  require  the  protocol  to  be  used  whenever  shared  storage  is  referenced. 

4  For  a  serialiaation  of  operations  on  <  shared  object  to  be  possible  the  object  must  be  regular  [12]  An  object  is  regular  when 
all  reads  not  concurrent  with  a  write  get  correct  values,  and  any  read  which  overlaps  a  series  of  writes  obtains  either  the  value  of  the 
object  before  the  first  of  the  writes,  or  one  of  the  values  being  written 
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2.3.  Data  Structures  for  Program  Monitoring 

During  the  monitoring  phase,  we  use  a  set  of  1-way  infinite  tapes,  called  history  tapes  to  record  the 
information  necessary  for  program  replay.  There  are  two  types  of  history  tape:  process  histon  tapes  are 
used  to  record  the  version  number  of  each  shared  object  accessed  by  a  process  and  object  history  tapes  are 
used  to  record  the  number  of  readers  that  have  accessed  each  version  of  a  shared  object  (Object  histon, 
tapes  are  used  to  record  read  operations  that  are  effectively  concurrent ;  they  are  unnecessary  if  a  mut„.. 
exclusion  protocol  is  used.)  A  process  history  tape  is  modified  only  by  the  corresponding  process;  an  objeei 
history  tape  is  modified  by  each  process  that  accesses  the  objecL  Since  the  relevant  information  is  recorded 
as  pan  of  the  access  to  an  object  the  monitoring  phase  imitates  whatever  parallelism  is  exhibited  by  the 
application. 

Each  history  tape  has  a  header  containing  several  fields:  a  pointer  to  the  current  square  on  the  tape  a 
pointer  to  the  last  non-blank  square  on  the  tape,  and  a  pointer  to  the  initial  square  on  the  tape  The 
operations  that  can  be  applied  to  a  history  tape  are: 

RewindHistoryTape  --  set  the  current  square  pointer  to  the  initial  square 
AdvanceHistoryTape  -  advance  the  current  square  pointer  to  the  next  square 
ReadHistoryTape  -•  read  the  value  written  in  the  current  square 
WriteHistoryTape  -  write  a  value  in  the  current  square 
IncHistoryTape  -  mcrement  the  value  stored  in  the  current  square 

Upon  creation,  each  shared  object  is  assigned  a  version  number  of  0  and  a  history  tape  that  is  initially 
blank.  Likewise,  upon  creation,  each  process  is  assigned  a  history  tape  that  is  initially  blank.  During  each 
read  or  write  operation  on  a  shared  object  by  a  process,  information  about  the  object  is  recorded  on  the 
process’s  history  tape  and  information  about  the  operation  is  recorded  on  the  object  history  tape  All 
history  tapes  created  during  the  execution  of  a  parallel  program  are  linked  together  to  form  a  tree  Each 
time  a  process  spawns  a  child,  a  pointer  to  the  history  tape  of  the  child  process  is  recorded  on  the  history 
tape  of  the  parent.  Similarly,  each  time  a  process  creates  a  shared  object,  a  pointer  to  the  history  tape  of 
the  shared  object  is  recorded  on  the  history  tape  of  the  parent.  Thus,  during  program  execution,  a  tree  of 
history  tapes  is  built  which  models  the  creation  of  objects  during  execuuon.  This  organization  of  history 
tapes  enables  each  process  (object)  history  tape  to  be  associated  with  the  correct  process  (object)  during 
execution  replay. 

In  addition  to  the  information  recorded  on  a  process's  history  tape  regarding  interactions  with  shared 
objects  and  child  processes,  arbitrary  details  of  a  process's  execution  can  be  recorded  on  the  tape  for  use 
during  replay.  Specifically,  the  resolution  of  certain  interesting  events  can  be  recorded  on  the  history  tape 
in  order  to  replay  programs  containing  nondeterminism.  The  information  recorded  about  such  events  can 
be  used  to  recreate  the  same  event  during  program  replay.  A  mechanism  to  support  the  recording  of  these 
events  would  need  to  be  added  to  the  implementation  of  the  programming  language  at  the  appropriate 
level  (i.e„  compiler  code  generation  or  language  run-time  support).  Such  a  mechanism  would  be 
appropriate  to  record  the  statement  alternative  chosen  in  a  nondeterministic  selection  statement,  whether  or 
not  a  timeout  interval  had  expired  during  execution,  and  clock  values  returned  by  system  calls 
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14.  Access  Protocols  for  Shared  Objects 

In  order  to  properly  record  a  partial  order  of  the  accesses  to  each  shared  object,  a  protocol  that 
ensures  a  valid  serialization  is  needed  In  this  section  we  will  describe  such  a  protocol,  a  concuiTent-read- 
exclusive- write  (CREW)  protocol  that  can  be  used  to  implement  Instant  Replay. 

The  CREW  access  protocol  for  shared  objects  consists  of  four  procedures:  entry  and  exit  procedures 
for  readers,  and  entry  and  exit  procedures  for  writers.  During  the  monitoring  phase,  these  procedures 
enforce  a  CREW  access  protocol  on  shared  objects  and  record  a  partial  order  of  accesses  to  each  shared 
object  During  the  replay  phase,  these  same  procedures  are  used  to  enforce  the  partial  order  recorded 
dunng  the  monitoring  phase. 

Each  process  that  reads  a  shared  object  must  use  the  entry  procedure  ReaderEntry  (Figure  1).  This 
routine  uses  a  semaphore  associated  with  the  object  to  ensure  that  the  process  is  momentarily  given 
exclusive  access  to  the  object  The  reader  process  increments  the  count  maintained  in  the  current  square  of 
the  object  history  tape  This  indicates  that  one  more  process  has  read  this  version  of  the  object.  The 
reader  also  records  the  version  number  of  the  object  on  its  process  history  tape.  At  the  end  of  the  entry 
procedure  the  semaphore  is  released,  allowing  other  readers  to  access  the  object  concurrently.  The  reader 
process  may  then  perform  one  or  more  read  operations  on  the  shared  object.  Eventually,  the  exit  routine 
ReaderExit  (also  in  Figure  1 )  is  called,  which  simply  decrements  the  number  of  readers  for  the  object 

ReaderEntry  (object process): 
if  mode  =  replay  then 

key  :=  ReadHistoryTape(process); 
while  not( object  version  =  key)  do  delay, 
end  if: 

P(objectlock); 

AtomicAdd(objectactiveReaders.  1): 
if  mode  =  replay  then 

objectreadercount  +=  1; 

V(object.lock); 

else 

I  ncH  istoryT  ape(  object): 
localVersion  :=  object  version; 

V(objectlock); 

WriteHistoryTape(process,localVersion); 
end  if; 

AdvanceHistoryTape(process); 
end  ReaderEntry; 

ReaderExit  (object); 

AtomicAdd(objectactiveReaders,  -1); 
end  ReaderExit; 

Figure  1 
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thereby  allowing  writers  a  chance  to  proceed  5 

In  replay  mode,  ReaderEntry  functions  a  little  differently.  Before  acquiring  the  semaphore,  a  reader 
must  wait  until  the  version  number  for  the  target  object  is  equal  to  the  version  number  recorded  on  the 
reader's  history  tape.  This  ensures  that  the  reader  will  see  the  correct  version  of  the  target  object  during 
replay.  A  separate  counter,  used  only  during  replay,  is  updated  to  reflect  the  number  of  readers  thus  far 
for  the  current  version  of  the  object.  This  counter  allows  writers  to  write  the  next  version  of  an  object  on]., 
when  all  readers  have  finished  with  the  current  version.  The  rest  of  the  entry  procedure  proceeds  as 
before,  except  that  history  tapes  are  not  written,  they  are  merely  read  and  advanced  as  execution  proceeds. 
The  same  exit  procedure  is  used  by  readers  during  the  monitoring  phase  and  replay. 

Each  process  that  modifies  a  shared  object  must  use  the  entry  procedure  WriterEntry  (Figure  2).  In 
this  routine,  the  writer  uses  a  semaphore  associated  with  the  object  to  gain  exclusive  access  to  the  object. 
Once  the  semaphore  is  acquired,  the  writer  process  waits  for  all  active  readers  to  finish.  No  new  readers 
can  access  the  object  since  the  entry  routine  for  a  reader  must  also  acquire  the  semaphore.  When  all 
readers  have  finished  with  the  object,  the  writer  is  free  to  access  the  current  version  of  the  object.  The 
writer  advances  the  object’s  history  tape  and  an  initial  value  of  0  is  written  onto  the  object’s  history  tape, 

WriterEntry  (object,  process): 
if  mode  =  replay  then 

key  :=  ReadHistoryTape(process): 
while  not(object. version  =  key)  do  delay: 
key  :=  ReadHistoryTape( object): 
while  not(objecLreadercount  =  key)  do  delay; 
end  if: 

P(object.lock); 

while  not(objecLactiveReaders  =  0)  do  delay; 

AdvanceHistoryTape(object); 
if  mode  =  replay  then 

object-readercount :  =  0; 
else 

WriteHistory  Tape(object,0) ; 

WriteHistoryTape(process,  object  version); 
end  if; 

objectversion  +=  1; 

AdvanceHistoryTape(process); 
end  WriterEntry; 


WriterExit  (object); 

V(objectlock); 
end  WriterExit; 


Figure  2 


5  We  use  atomic  increment  and  decrement  operations  to  maintain  the  oount  of  readers  for  an  object,  thereby  avoiding  the  need 
for  addiuonaJ  synchronization  in  the  exit  procedure 
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representing  the  fact  that  no  processes  have  read  the  upcoming  version.  Next,  the  writer  copies  the  object's 
current  version  number  onto  its  process  history  tape  as  a  record  of  the  version  that  preceded  the 
modification.  The  writer  completes  the  entry  procedure  by  incrementing  the  object's  version  number.  The 
writer  may  then  modify  the  shared  object.  Exclusive  access  is  maintained  because  the  semaphore  is  not 
released  until  the  exit  procedure  is  called.  The  WriterExh  routine  (also  in  Figure  2)  simply  releases 
exclusive  access  to  die  object  by  performing  a  V  operation  on  the  object’s  semaphore. 

In  replay  mode,  the  entry  routine  for  writers  also  functions  somewhat  differently.  Before  acquiring 
the  object's  semaphore,  the  writer  must  wait  until  the  current  version  of  the  object  matches  the  version 
number  recorded  on  the  writer’s  history  tape.  This  ensures  that  die  writer  modifies  the  correct  version 
Next,  the  writer  must  make  sure  that  die  number  of  readers  that  have  seen  the  current  version  of  the  object 
during  replay  is  equal  to  the  number  of  readers  that  saw  that  version  in  the  original  execuuon.  This 
ensures  that  a  version  is  not  changed  until  all  readers  have  seen  it  The  count  of  readers  of  the  upcoming 
version  is  initialized  to  0  and  the  version  number  is  incremented.  The  write  operation  can  now  proceed. 
The  WriterExit  procedure  is  the  same  for  both  monitoring  and  replay;  the  semaphore  is  released,  thereby 
terminating  exclusive  access  to  the  object. 

This  description  of  a  CREW  access  protocol  is  intended  to  be  illustrative,  not  definitive.  Instant 
Replay  requires  neither  a  CREW  protocol  nor  this  particular  implementation  of  a  CREW  protocol  As 
stated  previously,  we  could  use  an  ME  protocol  to  guarantee  a  valid  serialization.  In  addiuon.  a  different 
implementation,  one  that  does  not  use  shared  memory,  would  probably  be  required  for  a  loosely-coupled 
system.  For  example,  rather  than  accessing  shared  memory  locations  to  read  and  record  object  status 
information,  some  parts  of  the  protocol  could  be  implemented  as  remote  operations.  Regardless  of  the 
characteristics  of  a  particular  implementation  of  the  access  protocols,  our  basic  approach  is  to  record  a 
partial  order  of  operations  on  each  shared  object  and  ensure  the  same  order  during  program  replay 

3.  A  Multiprocessor  Prototype  of  Instant  Replay 

A  prototype  implementation  of  Instant  Replay  has  been  developed  for  the  BBN  Butterfly  Parallel 
Processor.  Several  considerations  motivated  the  choice  of  the  Butterfly  as  a  testbed.  First,  we  have  a 
Butterfly  at  the  University  of  Rochester,  but  lack  methods  and  tools  for  debugging  parallel  programs.  This, 
combined  with  the  current  surge  of  software  development  for  the  Butterfly,  created  an  urgent  need  we 
wanted  to  fulfill.  Second,  interprocess  communication  on  the  Butterfly  is  inexpensive,  which  tends  to 
encourage  development  of  communication-intensive  programs.  Third,  communication  on  the  Butterfly  is 
available  over  a  wide  range  of  granularities:  process  interactions  can  occur  through  direct  sharing  of 
memory,  or  through  the  use  of  higher-level  primitives  for  message- passing  Finally,  the  high  degree  of 
parallelism  offered  by  the  Butterfly  provides  a  challenging  test  since  highly  parallel,  communication- 
intensive  applications  necessarily  will  experience  the  greatest  performance  degradation  using  any  program 
monitoring  technique. 


3.1.  The  BBN  Butterfly  Parallel  Processor 

The  BBN  Butterfly  Parallel  Processor  at  the  University  of  Rochester  consists  of  128  processing  nodes 
connected  by  a  switching  network.  Each  switch  node  in  the  switching  network  is  a  4-input,  4-output 
crossbar  switch  with  a  bandwidth  of  32  megabits/sec.  Each  processor  is  an  8  MHz  MC68000  with  24  bit 
virtual  addresses.  A  2901-based  bit-slice  co-processor  interprets  every  memory  reference  issued  by  the 
68000  and  is  used  to  communicate  with  other  nodes  across  the  switching  network.  All  the  memory  in  the 
system  resides  on  individual  nodes,  but  any  processor  can  address  any  memory  through  the  switch.  A 
remote  memory  reference  (read)  takes  about  4  us.,  roughly  S  times  as  long  as  a  local  reference. 

Chrysalis  [19],  the  Butterfly  operating  system,  consists  largely  of  a  protected  subroutine  library  that 
implements  operations  on  a  set  of  primitive  data  types,  including  event  blocks  (structures  used  by  processes 
to  post  a  word  of  data  to  the  event  owner),  dual  queues  (queues  that  hold  a  sequence  of  long  word  data 
enqueued  by  processes,  or  alternatively,  a  sequence  of  process  handles  corresponding  to  processes  waiting 
to  dequeue  data  as  it  becomes  available),  shared  memory  segments,  and  a  global  name  table.  Objects  of 
these  types  can  be  shared  among  all  processes  executing  on  the  machine.  Low-level  operations  on  these 
data  types  are  provided  by  Chrysalis,  many  of  which  are  implemented  by  microcode.  These  primitive 
operations  provide  a  general  framework  upon  which  efficient  high-level  communication  protocols  and 
software  systems  can  be  built 

3.2.  Monitoring  Chrysalis  Operations 

Our  prototype  implementation  provides  programmers  with  encapsulated  versions  of  the  Chrysalis 
primitive  operations  on  events,  dual  queues,  shared  memory  objects,  and  processes.  The  encapsulated 
versions  of  the  Chrysalis  primitives  enforce  CREW  access  synchronization  and  record  a  partial  order  on  the 
operations  as  detailed  in  the  previous  section.  This  implementation  was  done  at  the  level  of  primitive 
Chrysalis  operations  to  make  replav  available  to  all  programs;  it  can  be  used  in  any  software  system 
developed  on  top  of  the  Chrysalis  operating  system.  In  particular,  recent  system  development  efforts  at  the 
University  of  Rochester  that  can  be  easily  modified  to  incorporate  Instant  Replay  include  LYNX,  a 
programming  language  and  run-ume  system  for  distributed  computing  [22,23],  and  SMP,  a  message-passing 
system  that  supports  multicast  message  communication  among  groups  of  processes  [IS]. 

While  encapsulating  the  Chrysalis  primitives  for  events  and  dual  queues,  it  became  apparent  that 
requiring  a  CREW  protocol  for  all  operations  was  inappropriate.  Most  of  the  operations  on  events  and 
dual  queues  are  atomic,  which  means  that  the  operations  must  occur  serially  with  respect  to  their  target 
data  object  (a  characteristic  of  the  hardware).  The  CREW  protocol  allows  concurrent  readers  of  shared 
objects,  but  introduces  additional  cost.  Since  event  and  dual  queue  operations  cannot  exploit  concurrent 
execution  of  readers,  the  expense  of  the  CREW  protocol  is  not  justified.  By  replacing  the  CREW  protocol 
with  the  simpler  mutual  exclusion  (ME)  protocol,  we  force  die  aerial  execution  of  the  Chrysalis  event  and 
dual  queue  primitive  operations,  but  reduce  execution  overhead  by  simplifying  the  entry  and  exit  protocols. 
An  ME  protocol  also  eliminates  the  need  for  object  history  tapes  for  events  and  dual  queues  since  we  can 
record  a  total  order  using  only  process  history  tapes. 


Using  encapsulated  versions  of  Chrysalis  primitives  in  program  code  requires  no  additional  effort 
beyond  that  necessary  to  use  the  original  primitives.  Additional  program  code  is  only  necessary  for 
regulating  access  to  shared  memory  objects.  Chrysalis  provides  primitives  for  sharing  segments  of  memory. 
General  sharing  of  memory  objects  as  provided  by  the  Butterfly  hardware  and  Chrysalis  primitive 
operations  imposes  no  restrictions  on  memory  access  other  than  serializing  word  operations  on  each  node, 
since  the  memory  hardware  has  only  a  single  port  To  guarantee  that  operations  on  such  shared  segments 
conform  to  a  CREW  access  protocol,  it  is  necessary  to  use  access  entry  and  exit  routines  to  control  sharing 
of  these  segments.  The  programmer  can  control  the  granularity  of  operations  bracketed  by  the  access 
routines  in  response  to  performance  concerns.  By  controlling  the  cost  of  the  operations  within  an  entry  and 
exit  routine  pair,  the  programmer  can  balance  the  reduction  of  parallelism  incurred  when  locking  for  long 
periods  of  time  with  the  overhead  of  frequently  executing  the  locking  primitives.  (Since  the  access  protocol 
entry  and  exit  routines  have  a  small  critical  section  requiring  mutual  exclusion,  there  is  a  serial  nature  to 
their  execution.) 

3.3.  Case  Studies 

Two  applications  were  chosen  for  experiments  in  program  monitoring  and  replay:  computation  of  a 
knight’s  tour  of  a  chessboard  and  Gaussian  elimination.  The  knight's  tour  problem  was  chosen  because 
there  is  an  existing  implementation  on  the  Butterfly  that  exhibits  extremely  nondeterministic  behavior.  A 
parallel  implementation  of  Gaussian  elimination  was  chosen  for  study  since,  unlike  the  knight’s  tour 
problem,  no  matter  what  execution  path  occurs  when  the  program  is  run.  the  overall  amount  of 
computation  performed  by  the  program  is  constant.  Also,  our  implementation  of  Gaussian  elimination  has 
already  been  the  subject  of  a  thorough  performance  study  [14]  and  the  statistics  previously  obtained  about 
the  program's  execution  behavior  can  be  used  as  a  baseline  for  comparison  to  determine  the  cost  of  our 
monitoring  techniques. 

33.1.  Knight’s  Tour 

A  knight’s  tour  is  a  path  on  a  chessboard  for  a  knight  that  successively  visits  each  square  once  and 
only  once  using  legal  chess  moves.  Our  program  to  compute  a  knight’s  tour  of  a  chessboard  consists  of  a 
master  process  and  a  user-specified  number  of  slave  processes.  The  master  selects  an  initial  position  of  the 
knight  on  the  chessboard  and  enters  the  corresponding  board  description  in  a  global  task  queue.  Next,  the 
master  creates  a  set  of  slave  processes  that  cooperate  to  search  for  a  knight’s  tour  beginning  with  the  initial 
board  position.  Each  slave  removes  a  set  of  board  descriptions  from  the  global  task  queue  and  replaces  it 
with  a  new  set  of  board  descriptions  that  could  be  generated  by  adding  a  legal  move  of  the  knight  from  its 
previous  position.  The  order  that  these  board  descriptions  are  added  and  deleted  from  the  task  queue 
determines  the  breadth  and  depth  of  the  search  performed.  Since  the  order  in  which  slave  processes  are 
granted  access  to  die  task  queue  depends  on  the  relative  progress  of  the  processes  and  resolution  of 
memory  contention  for  the  task  queue,  successive  executions  of  the  program  tend  to  produce  different 
tours. 

Calls  to  monitored  versions  of  the  task  queue  primitive  operations  (the  task  queue  is  a  dual  queue) 
were  inserted  in  the  program  in  place  of  the  original  calls  to  Chrysalis  primitives.  These  modifications  did 
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not  require  substantial  effort  and  caused  no  significant  growth  in  code  size.  The  effect  on  the  performance 
of  each  individual  primitive  is  substantial,  since  the  original  primitives  are  implemented  in  microcode  and 
there  is  no  such  support  for  the  history  tape  maintenance  operations.  However,  the  effect  on  overall 
program  performance  is  difficult  to  measure  due  to  the  inherent  nondeterministic  nature  of  the  knight  s 
tour  computation.  We  cannot  obtain  identical  executions  of  the  monitored  and  unmonitored  versions  of 
the  program  to  compare  execution  times  because  such  times  vary  wildly  between  successive  invocations  of 
the  program.  We  were  able  to  measure  accurately  the  comparative  execution  times  for  a  knight’s  tour 
program  during  the  monitoring  phase  and  the  replay  phase  of  the  same  execution.  The  difference  between 
the  two  execution  times  was  less  than  5%. 

Using  16  processors,  three  successive  executions  required  18,  38,  and  52  seconds  to  find  three 
different  solutions  for  a  5x5  chessboard;  the  executions  used  12K,  36K,  and  60K  bytes,  respectively,  for 
history  tapes.6  Using  64  processors,  a  solution  was  found  in  43  seconds  and  required  48K  bytes  for  history 
tapes.  It  is  not  surprising  that  the  amount  of  space  required  for  the  history  tapes  of  the  knight's  tour 
program  varies  with  the  amount  of  time  taken  to  find  a  solution.  Communication  is  roughly  a  constant 
percentage  of  the  computation  and  no  matter  how  many  processors  are  working  on  the  task, 
communication  speed,  hence  history  tape  space  requirements,  is  limited  by  the  need  to  serialize  access  to  a 
single  shared  task  queue.  We  estimate  that  the  knight’s  tour  program  generates  between  250  and  300 
communication  events  per  second:  each  communication  event  requires  four  bytes  to  record.  From  this  we 
can  estimate  the  space  requirements  for  the  history  tape  as  a  function  of  the  time  needed  to  find  a 
particular  solution. 

3J.2.  Gaussian  Elimination 

To  obtain  an  empirical  comparison  of  the  relative  cost  of  monitored  and  unmonitored  program 
executions,  an  existing  program  to  solve  a  system  of  linear  equations  using  Gaussian  elimination  was 
instrumented.  In  Gaussian  elimination,  the  total  amount  of  work  performed  by  the  program  is 
independent  of  the  precise  ordering  of  interprocess  events  during  execution;  the  computation  for  each  pivot 
row  depends  on  a  fixed  number  of  other  rows. 

The  implementation  of  Gaussian  elimination  uses  a  broadcast  message-passing  system  as  the  basis  for 
communication  among  the  cooperating  processes  in  the  program.7  A  single  master  process  initializes  shared 
data  structures  and  then  spawns  worker  processes  to  diagonalize  the  matrix.  The  master  delegates  rows  of 
the  matrix  to  each  slave  process  participating  in  the  solution.  Each  time  the  processing  of  a  row  is 
completed,  the  contents  are  broadcast  by  the  process  holding  that  row  to  each  of  the  other  slaves. 

To  instrument  this  application  we  replaced  some  dual  queue  and  event  primitives  used  for 
synchronization  between  the  master  and  slaves  with  monitored  versions  of  the  Chrysalis  primitives.  The 

*  Our  current  implementation  uses  a  32  bit  word  for  each  entry  on  a  history  tape.  althou|h  16  bit  words  would  suffice  for  our 
case  studies,  as  well  as  most  other  programs  Therefore,  our  space  requirements  are  conservative  and  could  easily  be  reduced  by  a  fac¬ 
tor  of  2. 

7  The  message-passing  system  used  here  is  an  early  prototype  of  SMP  (15)  The  results  described  in  this  section  are  particularly 
relevant  to  programs  based  on  SMP,  or  similar  communication  models. 


14 


underlying  message-passing  system,  however,  required  more  extensive  changes.  Message-passing  »as 
implemented  using  shared  memory  segments  as  communication  buffers.  Modifications  to  the  send  and 
receive  primitives  of  the  message-passing  system  were  required  to  enforce  the  CREW  access  protocols,  as 
detailed  in  section  2,  for  the  shared  communication  buffers. 

Although  the  code  overhead  and  programming  effort  to  make  this  transformation  were  more 
substantial  than  that  required  for  the  knight's  tour,  the  size  of  the  effort  was  still  small.  The  original 
Gaussian  elimination  program  contains  10S9  lines  of  code.  To  instrument  the  program  for  Instant  Replay . 
24  lines  of  code  were  altered  and  17  lines  of  code  were  added.  Most  of  the  changes  to  the  source  code  files 
occurred  in  the  message-passing  module.  Figure  3  shows  the  skeletal  form  of  the  monitored  message¬ 
passing  routines. 

The  performance  of  the  Gaussian  elimination  implementation  was  degraded  by  the  enforcement  of  a 
CREW  protocol  on  shared  object  access  and  recording  the  access  order  to  shared  objects.  Figure  4  depicts 
the  performance  of  monitored  and  unmonitored  versions  of  the  application  on  a  400x400  matnx.  The 
unmonitored  program  improves  dramatically  in  performance  as  additional  processors  become  involved  in 
the  computation,  however,  there  is  no  significant  improvement  in  performance  when  more  than  32 
processors  are  in  use.  In  fact,  performance  begins  to  degrade  slightly  beyond  32  processors  because  the 
additional  communication  involved  is  not  justified  by  the  gain  in  parallelism  [14].  A  naive  placement  of 
unoptimized  monitoring  operations  in  this  program  resulted  in  severe  degradation  of  performance  when 
more  than  8  processors  were  in  use.  as  shown  in  Figure  4.  This  demonstrates  the  importance  of  efficient 
monitoring  operations.  Optimizing  the  monitoring  code  greatly  improved  the  performance,  but  sull 
managed  to  double  the  execution  time  of  the  program  on  32  processors.  Examination  of  the  monitoring 
cost  showed  that  the  program  was  spending  a  great  deal  of  time  monitoring  and  recording  noncriucal 
polling  operations  on  buffers.  To  lower  the  cost  of  monitoring,  we  devised  a  special  entry  procedure  for 

Send  Message 
Find  buffer 

WriterEntryfbuffer.  myprocess) 

Copy  message  into  buffer 
Set  number  of  recipients 
WriterExitfbuffer) 

Receive  Message 

ReaderEntryPoUfbuffers,  myprocess) 

Poll  incoming  message  buffers 
Copy  message  into  user  area 
ReaderExitPoll(buffers) 

WriterEntry(buffer,  myprocess) 

Decrement  number  of  recipients 
WriterExit(buffer) 

Figure  3 
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use  with  the  common  programming  idiom  in  which  readers  poll  before  reading  a  value. 

Our  implementation  of  message-passing  uses  polling  to  find  incoming  messages.  Whenever  a  process 
attempts  to  receive  a  message,  a  large  number  of  buffers,  one  for  each  process  in  the  computation,  are 
polled.  Our  naive  placement  of  monitoring  operauons  considered  each  polling  operation  as  an  access  to  a 
shared  object,  which  was  duly  recorded  on  the  process's  history  tape.  The  realization  that  none  of  the 
polling  operations,  except  the  last  one,  are  necessary  for  replay  led  us  to  devise  a  special  entry  procedure 
used  in  conjunction  with  polling.  With  this  new  entry  procedure,  the  access  ordering  to  a  buffer  is 
recorded  only  when  a  message  is  found.  An  indication  of  which  buffer  supplied  the  message  and  the 
version  number  for  that  buffer  are  recorded  on  the  process's  history  tape.  During  replay,  only  the  buffer 
from  which  a  process  received  a  message  during  the  monitoring  phase  is  polled.  Use  of  this  entn 
procedure  eliminated  recording  of  noneaential  ordering  information  during  the  monitoring  phase,  saving 
both  time  and  storage  space  for  the  information  collected.  The  performance  of  the  program  using  the 
special  entry  procedure  is  also  shown  in  Figure  4.  The  result:  we  were  able  to  monitor  a  communication - 
intensive  application  for  replay  by  imposing  overhead  of  less  than  10%. 

As  we  have  already  stated.  Gaussian  elimination  is  a  communication-intensive  program,  which  tends 
to  produce  large  history  tapes.  Diagonal uation  of  an  800x800  matrix  on  64  processors  requires  400KB  for 
the  history  tapes  While  this  is  not  a  small  amount  of  space,  it  is  worth  comparing  the  space  requirements 
for  our  method  with  other  techniques  that  save  the  contents  of  every  message  received  by  a  process  in  an 
event  log.  Such  an  approach  requires  over  1S0MB  of  space!  In  general.  Instant  Replay  will  always  Lake 
less  space  than  an  event  log  whenever  large  messages  are  involved,  since  we  only  require  between  4  and  i 
bytes  for  each  message 1 

4.  Instant  Replay  in  the  Debugging  Cycle 

Program  replay  makes  it  possible  to  repeat  the  execution  of  a  parallel  program  as  often  as  desired. 
Unfortunately.  Instant  Replay  does  not  automatically  debug  programs,  parallel  or  otherwise.  Ho*  then  do 
we  use  the  replay  capability  to  debug  parallel  programs?  in  this  secuon  we  describe  several  techniques  for 
error  isolation  that  can  be  used  together  with  our  approach.  We  have  already  used  some  of  these 
techniques  in  our  own  work,  others  require  the  cooperation  of  additional  tools  that  we  have  not  yei 
developed. 

Our  goal  is  to  provide  repeatable  execution,  so  that  it  is  possible  to  observe  the  same  execution  of  a 
parallel  program  over  and  over.  Any  results  that  may  have  been  ignored  during  previous  observations  can 
always  be  reproduced  on  demand  for  closer  examination.  This  capability  is  especially  useful  for  parallel 
programs  since  (a)  multiple  processes  tend  to  generate  a  lot  of  output,  making  it  easy  to  miss  important 
results  and  (b)  the  programming  environments  for  parallel  architectures  are  not  as  mature  as  the 
programming  environments  for  sequential  machines,  and  often  lack  tools  for  collecting  and  analyzing 

*  Normally,  four  bytaa  par  mamge  ut  needed.  however,  the  polling  entry  procedure  mod  by  Geusuan  elimination  requires  8 

bytea 


output  data  However,  the  most  important  reason  for  reproducible  behavior  is  that  it  makes  cyclic 
debugging  possible. 

The  simplest  form  of  cyclic  debugging  is  to  add  output  statements  to  an  erroneous  program  that 
provide  additional  details  about  the  execution  of  the  program.  Successive  executions  can  be  used  to 
provide  successively  greater  detail  about  those  parts  of  the  program  under  suspicion.  This  technique  does 
not  work  with  parallel  programs  in  general  because  the  output  statements  can  change  the  relative  timing  of 
operations  within  the  program  and  yield  a  different  execution  sequence.  With  Instant  Replay,  however, 
any  number  of  output  statements  can  be  added  to  the  program  without  changing  the  execution  sequence 
provided  by  the  replay  mechanism.  In  fact,  any  type  of  statement  may  be  added  to  the  program  during 
replay,  as  long  as  the  additions  do  not  affect  the  sequence  of  interactions  with  shared  objects  by  each 
process.  Thus,  the  programmer  can  debug  parallel  programs  by  adopting  the  same  cyclic  methodology  for 
error  isolation  used  in  debugging  sequential  programs.  We  have  found  that  this  capability  alone  is  a 
valuable  tool  for  debugging  parallel  programs,  particularly  in  the  absence  of  other  debugging  tools 

Repeatable  execution  also  makes  top-down  debugging  possible.  Hierarchical  abstraction  of  detail  is 
necessary  to  cope  with  the  complexity  of  large  software  systems.  Abstraction  is  particularly  important  in 
understanding  the  behavior  of  parallel  programs.  The  programmer  should  not  have  to  be  concerned  w  ith 
the  low-level  details  of  execution  of  a  parallel  program,  such  as  the  interleaving  of  primitive  operations. 
Instead,  we  are  interested  in  the  salient  features  of  the  execution  that  characterize  its  behavior.  Our 
approach  allows  the  programmer  to  start  with  a  high-level  view  of  a  program's  behavior,  produced  by 
normal  output  statements  or  an  event  mechanism  similar  to  Behavioral  Abstraction.  By  carefully  refining 
that  viewpoint,  based  on  the  information  made  available  during  successive  replay,  the  programmer  can 
study  erroneous  behavior  at  any  level  of  detail  desired.  As  a  result,  one  can  diagnose  program  errors  in  a 
top-down  fashion  without  wading  through  voluminous,  irrelevant  detail  at  each  step. 

Another  common  technique  used  to  debug  sequential  programs  is  breakpoint  insertion.  Breakpoints 
are  added  to  the  program  at  interesting  points  in  the  code.  Execution  is  suspended  at  each  breakpoint, 
allowing  the  programmer  to  examine  the  system  slate.  Breakpoints  only  suspend  a  single  thread  of 
execution,  however,  which  is  not  sufficient  for  parallel  programs  consisting  of  multiple  threads  of  execution 
Inserting  a  breakpoint  in  one  process  of  a  parallel  program  will  have  an  effect  on  every  process  that 
communicates,  directly  or  indirectly,  with  the  suspended  process.  In  particular,  breakpoints  can  change  the 
relative  order  of  events  during  execution,  producing  a  different  execution  sequence  each  time.  Fortunately  , 
we  can  provide  reproducible  execution  even  in  the  presence  of  breakpoints.  No  matter  how  many 
breakpoints  are  encountered  during  replay,  we  continue  to  order  operations  based  on  the  contents  of 
history  tapes.  A  process  that  is  suspended  by  a  breakpoint  will  eventually  cause  all  other  processes  to  wan 
for  some  shared  object  to  be  read  or  written  (assuming  a  connected  graph  of  process  interactions).  When 
the  suspended  process  is  allowed  to  continue  beyond  the  breakpoint,  it  will  eventually  catch  up  to  the  other 
processes  and  the  entire  program  will  continue  executing.  Thus,  it  is  possible  to  cycle  through  breakpoints 
in  many  different  processes  during  program  replay,  examining  system  state  for  a  different  process  at  each 
breakpoint. 
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This  use  of  breakpoints  also  allows  the  programmer  to  examine  the  global  state  of  the  computation 
Due  to  communication  delays  and  a  reliance  on  local  viewpoints,  it  is  impossible  to  take  an  instantaneous 
snapshot  of  global  state.  However,  all  we  really  need  to  see  are  meaningful  global  slates  (4),  consistent 
states  based  on  the  happened  before  ordering  of  Lamport  [11],  For  example,  if  we  suspend  a  process  F  at 
breakpoint  X,  all  events  that  occurred  before  P  reached  X  should  be  reflected  eventually  in  all  other 
processes.  In  addition,  other  processes  should  not  be  allowed  to  proceed  beyond  any  point  that  requires 
process  P  to  proceed  beyond  X.  This  view  of  a  computation  is  the  best  we  can  hope  for  since,  if  all 
processes  are  stopped  as  the  result  of  setting  a  single  breakpoint,  the  happened  before  relation  cannot 
distinguish  between  the  global  state  represented  by  all  suspended  processes  and  an  omniscient  snapshot  of 
the  global  state  during  normal  execution.  We  provide  exactly  this  notion  of  global  state,  and  any  nouon 
that  attempts  to  be  more  precise  is  not  likely  to  be  meaningful  in  a  distributed  system. 

We  can  use  breakpoints,  in  conjunction  with  Instant  Replay,  to  provide  the  ability  to  halt  distributed 
programs  in  a  consistent  state,  as  in  [18],  without  the  need  for  additional  mechanisms.  By  setting  a  local 
breakpoint  during  replay  we  are,  in  effect,  setting  a  breakpoint  in  the  global  state.  When  the  local 
breakpoint  is  reached,  we  can  see  the  exact  state  of  the  local  process  containing  the  breakpoint,  and  the 
exact  state  of  all  other  processes  as  they  block  due  to  enforcement  of  the  happened  before  relation. 
Differences  between  the  state  of  each  process  in  an  instantaneous  snapshot  and  what  we  see  at  a  breakpoint 
reflect  the  natural  degree  of  asynchrony  between  processes  in  the  program. 

A  consequence  of  our  breakpoint  capability  is  the  ability  to  support  single-step  execution  of 
processes.  Single-step  execution  can  be  used  during  debugging  to  trace  the  state  transitions  of  an 
individual  process  or  the  effects  of  interprocess  communication  on  the  internal  states  of  communication 
partners.  We  can  replay  a  process  using  single-step  execution  because  enforcement  of  the  happened  before 
relation  ensures  that  asynchrony  between  processes  remains  within  allowable  bounds. 

Instant  Replay  can  also  be  used  in  conjunction  with  an  event  log  technique  to  allow  repeatable 
execution  of  a  subset  of  processes  involved  in  a  computation.  As  we  have  described  it.  our  approach 
requires  that  the  input  to  each  process  be  recomputed  during  replay,  rather  than  retrieved  from  an  event 
log.  This  is  both  an  advantage  and  a  disadvantage.  While  our  technique  requires  less  time  and  space 
during  the  monitoring  phase,  it  also  requires  that  all  processes  be  reexecuted  during  replay.  Global  replay 
is  a  disadvantage  if  the  computational  requirements  to  replay  a  program  are  very  large,  particularly  when  it 
is  unnecessary  to  recreate  the  entire  original  set  of  processes  to  isolate  an  error.  By  using  an  event  log 
together  with  Instant  Replay,  we  can  reexecute  the  subset  of  processes  in  which  we  are  interested  and 
simulate  the  rest 

There  is  a  tradeoff  between  the  expense  of  maintaining  an  event  log  during  normal  execution  and  the 
expense  of  reexecuting  all  processes  during  replay.  The  event  log  approach  and  Instant  Replay  represent 
two  extremes,  wherein  the  expense  is  shifted  from  the  monitoring  phase  to  the  replay  phase.  However,  a 
compromise  between  our  technique  and  the  event  log  approach  is  possible.  When  frequent  replay  of  a 
subset  of  processes  in  a  computation  is  desired,  as  would  be  the  case  when  using  cyclic  debugging  to  isolate 
errors,  it  is  possible  to  collect  additional  information  in  an  event  log  during  replay  that  would  eliminate  the 
need  for  reexecution  of  the  entire  program  during  subsequent  replay.  We  can  record  in  an  event  log  all 


external  inputs  to  the  subset  of  processes  of  interest.  This  record  would  include  both  inputs  from  the 
external  environment  and  inputs  from  processes  not  under  scrutiny.  Interactions  involving  processes  to  be 
reexecuted  during  replay  are  recorded,  as  before,  as  partial  orders  on  history  tapes.  On  subsequent 
executions,  only  the  designated  subset  of  processes  would  be  reexecuted  and  their  interface  with  the 
external  environment,  including  the  other  processes,  would  be  simulated  using  the  event  log.  Since  we 
assume  that  the  debugging  methodology  is  cyclic,  the  set  of  processes  that  are  simulated  by  an  event  log 
will  grow  larger  as  we  look  at  fewer  processes  in  greater  detail  (Le.,  top-down  debugging).  Note  however 
that  we  would  continue  to  use  Instant  Replay  in  the  monitoring  phase  because  it  has  the  least  impact  on 
normal  program  execution  and  can  be  used  to  generate  event  logs  during  the  debugging  cycle.9 

Finally,  we  can  use  Instant  Replay,  together  with  techniques  developed  by  Miller  [16. 17],  for  both 
causal  analysis  and  performance  monitoring  of  parallel  programs.  These  techniques  use  a  program  history 
graph  containing  a  partial  order  of  all  events  in  a  parallel  program  to  analyze  the  behavior  of  the  program. 
It  is  possible  to  change  some  aspects  of  the  history  graph  to  analyze  the  effect  of  changes  in  the  execution 
environment  [17],  however,  there  is  no  guarantee  that  modifying  system  parameters,  such  as  expected 
communication  delay  and  processor  load,  will  not  change  the  execution  behavior  of  the  program.  By  using 
Instant  Replay  to  guarantee  repeatable  execution  behavior,  it  is  possible  to  change  cost  labels  in  the  history 
graph  and  replay  the  program  under  new  assumptions,  without  changing  the  execution  behavior  of  the 
program.  (Of  course,  the  replay  mechanism  would  have  to  be  modified  to  incorporate  changes  to  the 
histo.y  graph,  such  as  the  message  delay  time.)  In  particular,  one  could  examine  the  effect  of 
communication  costs  on  overall  program  performance  by  artificially  varying  the  delay  associated  with 
communication.  Any  performance  results  derived  from  such  an  exercise  would  be  just  estimates,  but  it  is 
still  possible  to  learn  a  great  deal  about  parallel  programs  using  these  techniques. 

5.  Conclusions  and  Future  Work 

One  of  the  most  important  tools  for  analyzing  and  debugging  software  is  the  interactive  debugger. 
Cyclic  debugging  with  an  interactive  debugger  requires  the  ability  to  reproduce  program  behavior  on 
demand.  We  have  described  the  design  and  implementation  of  a  system  for  reproducible  execution  of 
parallel  programs.  In  summary,  Instant  Replay: 

•  provides  reproducible  execution  of  parallel  programs 

•  is  not  dependent  on  any  particular  form  of  interprocess  communication 

•  makes  possible  global  replay  of  programs,  rather  than  processes 

•  introduces  no  centralized  bottleneck,  either  during  monitoring  or  replay 

•  does  not  require  synchronized  clocks  or  globally-consistent  logical  time 

•  allows  modifications  to  programs  during  the  debugging  cycle 

•  has  only  minor  impact  on  program  performance  during  the  monitoring  phase 

•  has  reasonable  space  requirements 

•  is  applicable  to  both  loosely-coupled  and  tightly-coupled  environments 

’  in  extraordinary  circumstances  where  even  a  single  replay  is  impractical,  pneess  history  tapes  and  a  partial  event  log  could 
both  be  recorded  during  the  monitoring  phase 


There  are  two  potential  disadvantages  to  our  approach.  First,  we  record  a  version  number  for  each 
access  to  a  shared  object.  If  the  granularity  of  communication  is  very  small  (&&.  one  byte  messages),  we 
could  use  less  space  by  simply  storing  data  values  (ie..  the  event  log  approach).  Second,  we  require  that  all 
processes  in  a  program  reexecute  during  replay.  Even  though  we  have  shown  how  to  use  event  logs  to 
eliminate  some  processes  during  successive  replays,  no  iterative  technique  is  well-suited  to  programs  that 
are  impractical  to  reexecute. 

Nevertheless,  our  experience  has  shown  that  Instant  Replay  is  effective,  efficient,  and  practical. 
Additional  experience  with  our  technique  is  necessary,  however.  We  must  perform  further  empirical 
studies  to  determine  the  performance  cost  of  our  monitoring  technique  on  other  programming 
environments.  Specifically,  we  intend  to  explore  applications  of  our  techniques  to  message-based 
communication  in  loosely-coupled  systems  and  lightweight  tasks  and  shared  memory  in  tightly-coupled 
systems.  Our  case  studies,  while  very  different  in  programming  style,  do  not  address  all  of  the 
programming  models  we  wish  to  support. 

Several  optimizations  to  reduce  further  the  time  and  space  needs  of  our  technique  are  also  under 
consideration.  An  example  of  such  an  optimization  was  described  in  Section  3.  Other  optimizations  based 
on  similar  idempotent  operations  are  possible.  Another  interesting  optimization  is  based  on  the  observation 
that  some  parallel  programs  (or  segments  of  programs)  art  deterministic.  The  Gaussian  elimination 
program  is  a  good  example.  The  processes  that  perform  Gaussian  elimination  proceed  in  lockstep:  no 
monitoring  operations  are  necessary  to  reproduce  behavior.  It  is  possible  to  reduce  contention  and  space 
needs  for  monitoring  if  we  can  determine  that  some  sequence  of  interprocess  operations  yields  a 
deterministic  schedule.  Gearly  this  information  is  application-specific  and  may  only  be  obtainable  with 
programmer  assistance.  Nonetheless,  this  approach  is  worth  exploring  for  large  parallel  systems  with 
deterministic  components. 

Finally,  we  intend  to  explore  the  impact  of  Instant  Replay  on  the  development  of  a  general-purpose 
programming  environment  for  parallel  architectures.  Additional  tools  will  be  constructed  as  a  pan  of  any 
such  environment  (e.g..  source-level  single-process  debuggers  for  parallel  programs,  tools  to  monitor 
execution  with  graphical  displays,  compilers  to  automatically  instrument  programs),  and  we  will  want  to 
integrate  our  program  replay  capability  with  those  tools  as  they  are  developed. 
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